To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

نویسندگان

چکیده

Abstract Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although has recently witnessed several textual augmentation techniques, field lacks a systematic analysis diverse set of languages and tasks. To fill gap, we investigate three categories methodologies perform changes syntax (e.g., cropping sub-sentences), token random word insertion), character swapping) levels. We systematically compare methods part-of-speech tagging, dependency parsing, semantic role labeling language families using various models, architectures rely pretrained multilingual contextualized models such mBERT. Augmentation most significantly improves followed by labeling. find experimented techniques be effective morphologically rich general rather than analytic Vietnamese. Our results suggest can further improve over strong baselines based mBERT, especially parsing. identify character-level consistent performers, while synonym replacement syntactic augmenters provide inconsistent improvements. Finally, discuss heavily depend task, pair syntactic-level mostly benefit higher-level tasks richer languages), model type token-level provides significant improvements BPE, ones give generally higher scores char mBERT models).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

To Augment or Not to Augment: Solving Unsplittable Flow on a Path by Creating Slack

In the Unsplittable Flow on a Path problem (UFP) we are given a path with non-negative edge capacities and a set of tasks, each one characterized by a subpath, a demand, and a pro t. Our goal is to select a subset of tasks of maximum total pro t so that the total demand of the selected tasks on each edge does not exceed the respective edge capacity. UFP naturally captures several applications i...

متن کامل

Storage required to augment low flows: a regional study

An attempt has been made to provide a basis on which the amounts of deficit storage occurring in the natural hydrographs of small catchments can be estimated as a function of yield and the frequency of occurrence. The yield levels envisaged constitute only small proportions of the mean flow and the deficits arise due to seasonal fluctuations only. It is presumed that any deficits built up durin...

متن کامل

Using Dependency Parses to Augment Feature Construction for Text Mining

(ABSTRACT) With the prevalence of large data stored in the cloud, including unstructured information in the form of text, there is now an increased emphasis on text mining. A broad range of techniques are now used for text mining, including algorithms adapted from machine learning, NLP, computational linguistics, and data mining. Applications are also multi-fold, including classification, clust...

متن کامل

applying transitivity theory to gender analysis of efl textbook: : a comparative study.

efl/esl textbooks have been regarded as essential language teaching materials with which the learners spend about 70 up to 90 percent of their class time. the important role they play and their vast use make them not only influential in learning the language but also in shaping values and attitudes. put it another way, textbooks socialize learners using their contents (i.e. texts, illustrations...

15 صفحه اول

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computational Linguistics

سال: 2022

ISSN: ['1530-9312', '0891-2017']

DOI: https://doi.org/10.1162/coli_a_00425